NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Breaking the computation and communication abstraction barrier in distributed machine learning workloads

https://doi.org/10.1145/3503222.3507778

Jangda, Abhinav; Huang, Jun; Liu, Guodong; Sabet, Amir Hossein; Maleki, Saeed; Miao, Youshan; Musuvathi, Madanlal; Mytkowicz, Todd; Saarikivi, Olli (February 2022, ASPLOS 2022: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems)

Recent trends towards large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, the current logical separation between computation and communication kernels in machine learning frameworks misses optimization opportunities across this barrier. Breaking this abstraction can provide many optimizations to improve the performance of distributed workloads. However, manually applying these optimizations requires modifying the underlying computation and communication libraries for each scenario, which is both time consuming and error-prone. Therefore, we present CoCoNet, which contains (i) a domain specific language to express a distributed machine learning program in the form of computation and communication operations, (ii) a set of semantics preserving transformations to optimize the program, and (iii) a compiler to generate jointly optimized communication and computation GPU kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNet enabled us to optimize data-, model- and pipeline-parallel workloads in large language models with only a few lines of code. Our experiments show that CoCoNet significantly outperforms state-of-the-art distributed machine learning implementations.
more » « less
Full Text Available
Scalable FSM parallelization via path fusion and higher-order speculation

https://doi.org/10.1145/3445814.3446705

Qiu, Junqiao; Sun, Xiaofan; Sabet, Amir Hossein; Zhao, Zhijia (January 2021, Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’21))
null (Ed.)
Finite-state machine (FSM) is a fundamental computation model used by many applications. However, FSM execution is known to be “embarrassingly sequential” due to the state dependences among transitions. Existing solutions leverage enumerative or speculative parallelization to break the dependences. However, the efficiency of both parallelization schemes highly depends on the properties of the FSM and its inputs. For those exhibiting unfavorable properties, the former suffers from the overhead of maintaining multiple execution paths, while the latter is bottlenecked by the serial reprocessing among the misspeculation cases. Either way, the FSM parallelization scalability is seriously compromised. This work addresses the above scalability challenges with two novel techniques. First, for enumerative parallelization, it proposes path fusion. Inspired by the classic NFA to DFA conversion, it maps a vector of states in the original FSM to a new (fused) state. In this way, path fusion can reduce multiple FSM execution paths into a single path, minimizing the overhead of path maintenance. Second, for speculative parallelization, this work introduces higher-order speculation to avoid the serial reprocessing during validations. This is a generalized speculation model that allows speculated states to be validated speculatively. Finally, this work integrates different schemes of FSM parallelization into a framework—BoostFSM, which automatically selects the best based on the relevant properties of the FSM. Evaluation using real-world FSMs with diverse characteristics shows that BoostFSM can raise the average speedup from 3.1× and 15.4× of the existing speculative and enumerative parallelization schemes, respectively, to 25.8× on a 64-core machine.
more » « less
Full Text Available
Subway: minimizing data transfer during out-of-GPU-memory graph processing

https://doi.org/10.1145/3342195.3387537

Sabet, Amir Hossein; Zhao, Zhijia; Gupta, Rajiv (April 2020, Proceedings of the Fifteenth European Conference on Computer Systems)

Full Text Available

Search for: All records